Chapter 13 GWR and spatially lagged regression
13.1 Learning objectives
By the end of this practical you should be able to:
- Explain hypothesis testing
- Execute regression in R
- Descrbie the assumptions associated with regression models
- Explain steps to deal with spatially autocorrelated (spatial similarlity of nearby observations) residuals.
13.2 Homework
Outside of our schedulded sessions you should be doing around 12 hours of extra study per week. Feel free to follow your own GIS interests, but good places to start include the following:
Assignment
From weeks 6-9, learn and practice analysis from the course and identify appropriate techniques (from wider research) that might be applicable/relevant to your data. Conduct an extensive methodological review – this could include analysis from within academic literature and/or government departments (or any reputable source).
Reading This week:
Chapter 2 “Linear Regression” from Hands-On Machine Learning with R by Boehmke & Greenwell (2020).
Chapter 5 and 6 “Basic Regression and”Multiple Regression" from Modern Dive by Ismay and Kim (2019).
Remember this is just a starting point, explore the reading list, practical and lecture for more ideas.
13.3 Recommended listening
Some of these practicals are long, take regular breaks and have a listen to some of our fav tunes each week.
Adam This week it’s Radio1’s DnB 60, recorded live at the Warehouse Project. Die, Bryan G, Jumpin’ Jack Frost and it sounds like Dynamite on the mic — some proper rollers in here!
13.4 Introduction
In this practical you will be introduced to a suite of different models that will allow you to test a variety of research questions and hypotheses through modelling the associations between two or more spatially reference variables.
In the worked example, we will explore the factors that might affect the average exam scores of 16 year-old across London. GSCEs are the exams taken at the end of secondary education and here have been aggregated for all pupils at their home addresses across the City for Ward geographies.
The London Data Store collates a range of other variables for each Ward and so we will see if any of these are able to help explain the patterns of exam performance that we see.
This practical will walk you through the common steps that you should go through when building a regression model using spatial data to test a stated research hypothsis; from carrying out some descriptive visualisation and summary statistics, to interpreting the results and using the outputs of the model to inform your next steps.
13.4.1 Setting up your Data
First, let’s set up R and read in some data to enable us to carry out our analysis.
#library a bunch of packages we may (or may not) use - install them first if not installed already.
library(tidyverse)
library(tmap)
library(geojsonio)
library(plotly)
library(rgdal)
library(broom)
library(mapview)
library(crosstalk)
library(sf)
library(sp)
library(spdep)
library(car)read some ward data in
#download a zip file containing some boundaries we want to use
download.file("https://data.london.gov.uk/download/statistical-gis-boundary-files-london/9ba8c833-6370-4b11-abdc-314aa020d5e0/statistical-gis-boundaries-london.zip", destfile="prac9_data/statistical-gis-boundaries-london.zip")
#unzip it
unzip("prac9_data/statistical-gis-boundaries-london.zip", exdir="prac9_data")#look at the subfolders in the directory to work out where we want to get our shapefile from
list.dirs("prac9_data/statistical-gis-boundaries-london")## [1] "prac9_data/statistical-gis-boundaries-london"
## [2] "prac9_data/statistical-gis-boundaries-london/ESRI"
## [3] "prac9_data/statistical-gis-boundaries-london/MapInfo"
#read in the boundaries from the file you have just unzipped into your working directory
LondonWardsss <- readOGR("prac9_data/statistical-gis-boundaries-london/ESRI/London_Ward_CityMerged.shp", layer="London_Ward_CityMerged")## OGR data source with driver: ESRI Shapefile
## Source: "C:\Users\Andy\OneDrive - University College London\Teaching\CASA0005repo\prac9_data\statistical-gis-boundaries-london\ESRI\London_Ward_CityMerged.shp", layer: "London_Ward_CityMerged"
## with 625 features
## It has 7 fields
## Integer64 fields read as strings: POLY_ID
#convert it to a simple features object
LondonWardsssSF <- st_as_sf(LondonWardsss)
#check coordinate reference system
LondonWardsssSF## Simple feature collection with 625 features and 7 fields
## geometry type: POLYGON
## dimension: XY
## bbox: xmin: 503568.2 ymin: 155850.8 xmax: 561957.5 ymax: 200933.9
## epsg (SRID): NA
## proj4string: +proj=tmerc +lat_0=49 +lon_0=-2 +k=0.999601272 +x_0=400000 +y_0=-100000 +ellps=airy +towgs84=446.448,-125.157,542.06,0.1502,0.247,0.8421,-20.4894 +units=m +no_defs
## First 10 features:
## NAME GSS_CODE HECTARES NONLD_AREA LB_GSS_CD
## 0 Chessington South E05000405 755.173 0 E09000021
## 1 Tolworth and Hook Rise E05000414 259.464 0 E09000021
## 2 Berrylands E05000401 145.390 0 E09000021
## 3 Alexandra E05000400 268.506 0 E09000021
## 4 Beverley E05000402 187.821 0 E09000021
## 5 Coombe Hill E05000406 442.170 0 E09000021
## 6 Chessington North and Hook E05000404 192.980 0 E09000021
## 7 Surbiton Hill E05000413 166.482 0 E09000021
## 8 Old Malden E05000410 180.016 0 E09000021
## 9 St. Mark's E05000412 137.578 0 E09000021
## BOROUGH POLY_ID geometry
## 0 Kingston upon Thames 50840 POLYGON ((516401.6 160201.8...
## 1 Kingston upon Thames 117160 POLYGON ((517829.6 165447.1...
## 2 Kingston upon Thames 50449 POLYGON ((518107.5 167303.4...
## 3 Kingston upon Thames 50456 POLYGON ((520480 166909.8, ...
## 4 Kingston upon Thames 117161 POLYGON ((522071 168144.9, ...
## 5 Kingston upon Thames 117159 POLYGON ((522007.6 169297.3...
## 6 Kingston upon Thames 50530 POLYGON ((517175.3 164077.3...
## 7 Kingston upon Thames 50457 POLYGON ((517469.3 166878.5...
## 8 Kingston upon Thames 50455 POLYGON ((522231.1 166015, ...
## 9 Kingston upon Thames 50450 POLYGON ((517460.6 167802.9...
#Proj4 string tells me it's in wgs84, so Convert to British National Grid
BNG = "+init=epsg:27700"
LondonWardsssSFBNG <- st_transform(LondonWardsssSF, BNG)
#check the data
qtm(LondonWardsssSFBNG)Now we are going to read in some data from the London Data Store - https://data.london.gov.uk/
#read in some attribute data
LondonWardProfiles <- read_csv("https://data.london.gov.uk/download/ward-profiles-and-atlas/772d2d64-e8c6-46cb-86f9-e52b4c7851bc/ward-profiles-excel-version.csv", col_names = TRUE, locale = locale(encoding = 'Latin1'))13.4.1.1 Cleaning the data as you read it in
Examining the dataset as it is read in above, you can see that a number of fields in the dataset that should have been read in as numeric data, have actually been read in as character (text) data.
If you examine your data file, you will see why. In a number of columns where data are missing, rather than a blank cell, the values ‘n/a’ have been entered in instead. Where these text values appear amongst numbers, the software will automatically assume the whole column is text.
To deal with these errors, we can force read_csv to ignore these values by telling it what values to look out for that indicate missing data
#We can use readr to deal with the issues in this dataset - which are to do with text values being stored in columns containing numeric values
#read in some data - couple of things here. Read in specifying a load of likely 'n/a' values, also specify Latin1 as encoding as there is a pound sign (£) in one of the column headers - just to make things fun!
LondonWardProfiles <- read_csv("https://data.london.gov.uk/download/ward-profiles-and-atlas/772d2d64-e8c6-46cb-86f9-e52b4c7851bc/ward-profiles-excel-version.csv", na = c("", "NA", "n/a"), locale = locale(encoding = 'Latin1'), col_names = TRUE)Now you have read in both your boundary data and your attribute data, you need to merge the two together using a common ID. In this case, we can use the ward codes to achieve the join
#merge boundaries and data
LonWardProfiles <- left_join(LondonWardsssSFBNG, LondonWardProfiles, by = c("GSS_CODE" = "New code"))## Warning: Column `GSS_CODE`/`New code` joining factor and character vector,
## coercing into character vector
#let's map our dependent variable to see if the join has worked:
tmap_mode("view")
qtm(LonWardProfiles, fill = "Average GCSE capped point scores - 2014", borders = NULL)13.4.1.2 Additional Data
In addition to our main datasets, it might also be useful to add some contextual data. While our exam results have been recorded at the home address of students, most students would have attended one of the schools in the City.
Let’s add some schools data as well.
#might be a good idea to see where the secondary schools are in London too
london_schools <- read_csv("https://data.london.gov.uk/download/london-schools-atlas/57046151-39a0-45d9-8dc0-27ea7fd02de8/all_schools_xy_2016.csv")
#from the coordinate values stored in the x and y columns, which look like they are latitude and longitude values, create a new points dataset
lon_schools_sf <- st_as_sf(london_schools, coords = c("x","y"), crs = 4326)
#now pull out the secondary schools
#these are the same - one uses grep() and one uses the stringr() package
lond_sec_schools_sf <- lon_schools_sf[str_which(lon_schools_sf[["PHASE"]],"Secondary"),]
lond_sec_schools_sf <- lon_schools_sf[grep("Secondary",lon_schools_sf[["PHASE"]]),]
tmap_mode("view")
qtm(lond_sec_schools_sf)13.5 Analysing GCSE exam performance - testing a research hypothesis
To explore the factors that might influence GCSE exam performance in London, we are going to run a series of different regression models. A regression model is simply the expression of a linear relationship between our outcome variable (Average GCSE score in each Ward in London) and another variable or several variables that might explain this outcome.
13.5.1 Research Question and Hypothesis
Examining the spatial distribution of GSCE point scores in the map above, it is clear that there is variation across the city. My research question is:
What are the factors that might lead to variation in Average GCSE point scores across the city?
My research hypothesis that I am going to test is that there are other observable factors occurring in Wards in London that might affect the average GCSE scores of students living in those areas.
In inferential statistics, we cannot definitively prove a hypothesis is true, but we can seek to disprove that there is absolutely nothing of interest occurring or no association between variables. The null hypothesis that I am going to test empirically with some models is that there is no relationship between exam scores and other observed variables across London.
13.5.2 Regression Basics
For those of you who know a bit about regression, you might want to skip down to the next section. However, if you are new to regression or would like a refresher, read on…
The linear relationship in a regression model is probably most easily explained using a scatter plot:
q <- qplot(x = `Unauthorised Absence in All Schools (%) - 2013`, y = `Average GCSE capped point scores - 2014`, data=LonWardProfiles)
#plot with a regression line - note, I've added some jitter here as the x-scale is rounded
q + stat_smooth(method="lm", se=FALSE, size=1) + geom_jitter()
Here, I have plotted the average GCSE point score for each Ward in London against another variable in the dataset that I think might be influential: the % of school days lost to unauthorised absences in each ward.
Remember that my null hypothesis would be that there is no relationship between GCSE scores and unauthorised absence from school. If this null hypothesis was true, then I would not expect to see any pattern in the cloud of points plotted above.
As it is, the scatter plot shows that, generally, as the \(x\) axis independent variable (unauthorised absence) goes up, our \(y\) axis dependent variable (GCSE point score) goes down. This is not a random cloud of points, but something that indicates there could be a relationshp here and so I might be looking to reject my null hypothesis.
Some conventions - In a regression equation, the dependent variable is always labelled \(y\) and shown on the \(y\) axis of a graph, the predictor or independent variable(s) is(are) always shown on the \(x\) axis.
I have added a blue line of best-fit - this is the line that can be drawn by minimising the sum of the squared differences between the line and the residuals. The residuals are all of the dots not falling exactly on the blue line. An algorithm known as ‘ordinary least squares’ (OLS) is used to draw this line and it simply tries a selection of different lines until the sum of the squared divations between all of the residuals and the blue line is minimised, leaving the final solution.
As a general rule, the better the blue line is at summarising the relationship between \(y\) and \(x\), the better the model.
The equation for the blue line in the graph above can be written:
(1)\[y_i = \beta_0 + \beta_1x_i + \epsilon_i\]
where:
\(\beta_0\) is the intercept (the value of \(y\) when \(x = 0\) - somewhere around 370 on the graph above);
\(\beta_1\) is sometimes referred to as the ‘slope’ parameter and is simply the change in the value of \(y\) for a 1 unit change in the value of \(x\) (the slope of the blue line) - reading the graph above, the change in the value of \(y\) reading between 1.0 and 2.0 on the \(x\) axis looks to be around -40.
\(\epsilon_i\) is a random error term (positive or negative) that should sum to 0 - esentially, if you add all of the vertical differences between the blue line and all of the residuals, it should sum to 0.
Any value of \(y\) along the blue line can be modelled using the corresponding value of \(x\) and these parameter values. Examining the graph above we would expect the average GCSE point score for a student living in a Ward where 0.5% of school days per year were missed, to equal around 350, but we can confirm this by plugging the \(\beta\) parameter values and the value of \(x\) into equation (1):
## [1] 350
13.5.3 Running a Regression Model in R
In the graph above, I used a method called ‘lm’ in the stat_smooth function in ggplot2 to draw the regression line. ‘lm’ stands for ‘linear model’ and is a standard function in R for running linear regression models. Use the help system to find out more about lm - ?lm
Below is the code that could be used to draw the blue line in our scatter plot. Note, the tilde ~ symbol means “is modelled by”
#run the linear regression model and store its outputs in an object called model1
model1 <- lm(`Average GCSE capped point scores - 2014` ~ `Unauthorised Absence in All Schools (%) - 2013`, data = LonWardProfiles)
#show the summary of those outputs
summary(model1)##
## Call:
## lm(formula = `Average GCSE capped point scores - 2014` ~ `Unauthorised Absence in All Schools (%) - 2013`,
## data = LonWardProfiles)
##
## Residuals:
## Min 1Q Median 3Q Max
## -59.753 -10.223 -1.063 8.547 61.842
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 371.471 2.165 171.6
## `Unauthorised Absence in All Schools (%) - 2013` -41.237 1.927 -21.4
## Pr(>|t|)
## (Intercept) <2e-16 ***
## `Unauthorised Absence in All Schools (%) - 2013` <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.39 on 624 degrees of freedom
## Multiple R-squared: 0.4233, Adjusted R-squared: 0.4224
## F-statistic: 458 on 1 and 624 DF, p-value: < 2.2e-16
13.5.3.1 Interpreting and using the model outputs
In running a regression model, we are effectively trying to test (disprove) our null hypothesis. If our null hypothsis was true, then we would expect our coefficients to = 0.
In the output summary of the model above, there are a number of features you should pay attention to:
Coefficient Estimates - these are the \(\beta_0\) (intercept) and \(\beta_1\) (slope) parameter estimates from Equation 1. You will notice that at \(\beta_0 = 371.471\) and \(\beta_1 = -41.237\) they are pretty close to the estimates of 370 and -40 that we read from the graph earlier, but more precise.
Coefficient Standard Errors - these represent the average amount the coefficient varies from the average value of the dependent variable (its standard deviation). So, for a 1% increase in unauthorised absence from school, while the model says we might expect GSCE scores to drop by -41.2 points, this might vary, on average, by about 1.9 points. As a rule of thumb, we are looking for a lower value in the standard error relative to the size of the coefficient.
Coefficient t-value - this is the value of the coefficient divided by the standard error and so can be thought of as a kind of standardised coefficient value. The larger (either positive or negative) the value the greater the relative effect that particular independent variable is having on the dependent variable (this is perhaps more useful when we have several independent variables in the model) .
Coefficient p-value - Pr(>|t|) - the p-value is a measure of significance. There is lots of debate about p-values which I won’t go into here, but essentially it refers to the probability of getting a coefficient as large as the one observed in a set of random data. p-values can be thought of as percentages, so if we have a p-value of 0.5, then there is a 5% chance that our coefficient could have occurred in some random data, or put another way, a 95% chance that out coefficient could have only occurred in our data.
As a rule of thumb, the smaller the p-value, the more significant that variable is in the story and the smaller the chance that the relationship being observed is just random. Generally, statisticians use 5% or 0.05 as the acceptable cut-off for statistical significance - anything greater than that we should be a little sceptical about.
In r the codes ***, **, **, . are used to indicate significance. We generally want at least a single * next to our coefficient for it to be worth considering.
R-Squared - This can be thought of as an indication of how good your model is - a measure of ‘goodness-of-fit’ (of which there are a number of others). \(r^2\) is quite an intuitite measure of fit as it ranges between 0 and 1 and can be thought of as the % of variation in the dependent variable (in our case GCSE score) explained by variation in the independent variable(s). In our example, an \(r^2\) value of 0.42 indicates that around 42% of the variation in GCSE scores can be explained by variation in unathorised absence from school. In other words, this is quite a good model. The \(r^2\) value will increase as more independent explanatory variables are added into the model, so where this might be an issue, the adjusted r-squared value can be used to account for this affect
13.5.3.2 broom
The output from the linear regression model is messy and like all things R mess can be tidied, in this case with a broom! Or the package broom which is also party of the package tidymodels.
Here let’s load broom and tidy our output…you will need to either install tidymodels or broom
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 371. 2.16 172. 0.
## 2 `Unauthorised Absence in All Schools (%~ -41.2 1.93 -21.4 1.27e-76
We can also use glance() from broom to get a bit more information, such as \(r^2\) and the adjusted r-squared value.
## # A tibble: 1 x 11
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.423 0.422 16.4 458. 1.27e-76 2 -2638. 5282. 5296.
## # ... with 2 more variables: deviance <dbl>, df.residual <int>